Compact Universal k-mer Hitting Sets

نویسندگان

  • Yaron Orenstein
  • David Pellow
  • Guillaume Marçais
  • Ron Shamir
  • Carl Kingsford
چکیده

We address the problem of finding a minimum-size set of k-mers that hits L-long sequences. The problem arises in the design of compact hash functions and other data structures for efficient handling of large sequencing datasets. We prove that the problem of hitting a given set of L-long sequences is NP-hard and give a heuristic solution that finds a compact universal k-mer set that hits any set of L-long sequences. The algorithm, called DOCKS (design of compact k-mer sets), works in two phases: (i) finding a minimum-size k-mer set that hits every infinite sequence; (ii) greedily adding k-mers such that together they hit all remaining L-long sequences. We show that DOCKS works well in practice and produces a set of k-mers that is much smaller than a random choice of k-mers. We present results for various values of k and sequence lengths L and by applying them to two bacterial genomes show that universal hitting k-mers improve on minimizers. The software and exemplary sets are freely available at acgt.cs.tau.ac.il/docks/.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing

With the rapidly increasing volume of deep sequencing data, more efficient algorithms and data structures are needed. Minimizers are a central recent paradigm that has improved various sequence analysis tasks, including hashing for faster read overlap detection, sparse suffix arrays for creating smaller indexes, and Bloom filters for speeding up sequence search. Here, we propose an alternative ...

متن کامل

Improving the performance of minimizers and winnowing schemes

Motivation The minimizers scheme is a method for selecting k -mers from sequences. It is used in many bioinformatics software tools to bin comparable sequences or to sample a sequence in a deterministic fashion at approximately regular intervals, in order to reduce memory consumption and processing time. Although very useful, the minimizers selection procedure has undesirable behaviors (e.g. to...

متن کامل

A convex combinatorial property of compact sets in the plane and its roots in lattice theory

K. Adaricheva and M. Bolat have recently proved that if $,mathcal U_0$ and $,mathcal U_1$ are circles in a triangle with vertices $A_0,A_1,A_2$, then there exist $jin {0,1,2}$ and $kin{0,1}$ such that $,mathcal U_{1-k}$ is included in the convex hull of $,mathcal U_kcup({A_0,A_1, A_2}setminus{A_j})$. One could say disks instead of circles.Here we prove the existence of such a $j$ and $k$ ...

متن کامل

Aid Effectiveness in the Sustainable Development Goals Era; Comment on ““It’s About the Idea Hitting the Bull’s Eye”: How Aid Effectiveness Can Catalyse the Scale-up of Health Innovations”

Over just a six-year period from 2005-2011, five aid effectiveness initiatives were launched: the Paris Declaration on Aid Effectiveness (2005), the International Health Partnership plus (2007), the Accra Agenda for Action (2008), the Busan Partnership for Effective Cooperation (2011), and the Global Partnership for Effective Development Cooperation (GPEDC) (2011). More recently, in 2015, the A...

متن کامل

Universal Approximation of Interval-valued Fuzzy Systems Based on Interval-valued Implications

It is firstly proved that the multi-input-single-output (MISO) fuzzy systems based on interval-valued $R$- and $S$-implications can approximate any continuous function defined on a compact set to arbitrary accuracy.  A formula to compute the lower upper bounds on the number  of interval-valued fuzzy sets needed to achieve a pre-specified approximation  accuracy for an arbitrary multivariate con...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016